MIT ADSP Practical Data Science¶

Predicting Loan Defaults with Machine Learning¶

Prepared by Andria Koen

3 March, 2025

Executive Summary¶

This project aimed to develop and optimize a classification model to achieve high performance in predicting loan defaulters, focusing on minimizing financial risk for lending institutions. Using the HMEQ dataset, a Tuned Random Forest with Class Weights achieved 78% recall, significantly reducing false negatives. Key steps included data preprocessing, model selection, and hyperparameter tuning. The model can be integrated into the loan approval process to enhance risk assessment, reduce losses, and improve decision-making. Future steps include further threshold tuning and model monitoring. This model can help reduce loan defaults, leading to increased profitability for the institution.

Context & Business Impact¶

Loan defaults pose a significant financial risk to lending institutions, leading to direct monetary losses, increased administrative costs, and reputational damage. High default rates also indicate potential inefficiencies in loan approval and risk assessment frameworks. To mitigate these risks, financial institutions need accurate models that predict loan defaults and enhance decision-making. Accurate default prediction is crucial for efficient risk assessment and sustainable growth. This model enables proactive risk management, reducing default rates and enhancing profitability.

Project Objective¶

This project aims to develop and implement a supervised machine learning model capable of predicting the likelihood of loan defaults for new applicants. By integrating this model into the loan approval workflow, lenders can proactively manage risk and reduce default rates. The primary performance metric is achieving at least 80% recall in predicting defaults within the first year of loan issuance, ensuring minimal false negatives and improving loan approval decisions.

Key Findings & Methodology¶

  • Data Utilization: The model is trained on the Home Equity dataset (HMEQ), which includes 5,960 loan records and 12 input variables such as loan amount, mortgage due, applicant job type, credit history, and debt-to-income ratio.
  • Feature Importance: Variables such as credit score, delinquency history, debt-to-income ratio, and number of derogatory reports were found to be strong predictors of loan default.
  • Machine Learning Models Evaluated: The project tested multiple classification algorithms, including Logistic Regression, Decision Trees, and Random Forest.
  • Best Performing Model: Random Forest emerged as the most effective model, balancing recall, precision, and overall accuracy.
  • Data Challenges Addressed: Handling missing values, outliers, and data imbalance through preprocessing techniques such as scaling, imputation, and resampling.
  • Model Optimization: Hyperparameter tuning using GridSearchCV improved model performance, leading to higher recall while maintaining precision and accuracy.
  • Initial model selection: included Logistic Regression, Decision Trees, and Random Forests. The Default Random Forest model achieved the highest train accuracy (1.0) but showed signs of overfitting.
  • The Tuned Random Forest model: improved generalization, with a balanced trade-off between recall and precision. Class imbalance was a key challenge, leading to the introduction of weighted models and threshold tuning.

Recommendations for Implementation & Business Impact with Detailed Model Prediction Analysis¶

Integrate the model into the loan approval process for real-time risk assessment. This will improve profitability by reducing default losses and ensure data-driven, fair lending practices. By improving risk assessment, financial institutions can enhance profitability by reducing losses due to defaults. The model provides a data-driven approach to credit decision-making, ensuring fairness and transparency in lending practices.

Detailed Model Prediction Analysis:

The following metrics highlight the model's performance and its potential impact on loan default prediction:

Metric Value Description
Total Loans Analyzed 5,960 The total number of loans evaluated by the model.
True Positives (TP) 280 Number of loans correctly predicted to default.
False Negatives (FN) 77 Number of loans incorrectly predicted as non-default (actual defaults missed).
False Positives (FP) 155 Number of loans incorrectly predicted to default (actual non-defaults).
Estimated Default Rate 13% The predicted default rate after applying the model.
Reduction in Default Rate 7% The difference between the baseline default rate and the estimated default rate.
Potential Defaults Prevented 417 Estimated number of defaults prevented through intervention based on model predictions.
Potential Loss Reduction $7,760,736 Estimated financial savings from the reduction in defaults (based on an average loan of $18,608).

This table demonstrates the effectiveness of the loan default prediction model in reducing risk and potential financial losses. The model is predicted to decrease the loan default rate by 7%, resulting in a potential savings of over $7.76 million.

Risks and Challenges¶

  • Data and Model Issues: Changes in data, biases, and security concerns can impact model accuracy and fairness.
  • Operational Challenges: Implementation, maintenance, and regulatory compliance pose significant hurdles.

Next Steps:¶

  • Model Optimization: Fine-tune the model, explore new features, and compare against alternatives for better performance.
  • Practical Implementation: Deploy, monitor, and refine the model through error analysis and explainability techniques.
  • Business Alignment: Adjust the model for cost sensitivity and ensure it meets regulatory and business needs.

Conclusion¶

This machine learning-based approach to predicting loan defaults provides a powerful tool for financial institutions to enhance risk assessment and optimize lending decisions. With further refinement and real-time deployment, the model can significantly improve profitability and operational efficiency in loan management. This machine learning model provides a powerful tool for enhancing risk assessment and optimizing lending decisions. Further refinement and real-time deployment can significantly improve loan management efficiency.

Data Overview¶

Problem Definition: Predicting Loan Defaults¶

The Context

Loan defaults represent a significant financial risk to lending institutions, directly impacting profitability and operational efficiency. Defaulting loans lead to direct financial losses, increased administrative costs associated with collections and legal proceedings, and potential damage to the institution's reputation. Furthermore, high default rates can signal underlying issues within the loan approval process and risk assessment framework. In the current competitive lending environment, accurate risk assessment and proactive default prediction are crucial for sustainable growth and market stability.

The Objective

The primary objective of this project is to develop and implement a supervised machine learning model capable of accurately predicting the likelihood of loan default for new applicants. This model will serve as an integral component of the loan approval workflow, enabling the lending institution to make informed decisions regarding loan issuance and risk management. The intended outcome is a reduction in loan defaults, leading to increased profitability and improved risk mitigation.

The target performance metric for this model is to achieve at least 80% recall in predicting defaults within the first year after loan issuance. This will help us prioritize avoiding false negatives. Secondary metrics include precision and accuracy.

Key Questions

To achieve the objective, we need to answer the following key questions:

  1. What applicant and loan characteristics are the strongest predictors of loan default? We need to identify the features (e.g., credit score, debt-to-income ratio, loan amount, employment history, etc.) that are most indicative of default risk.
  2. Which supervised machine learning algorithms are best suited for predicting loan defaults given the available data and desired performance metrics? We'll need to compare the performance of different classification algorithms (e.g., Logistic Regression, Random Forest, Decision Trees, etc.) to select the most effective model.
  3. How can we optimize the selected model's hyperparameters to maximize performance in identifying potential defaults while minimizing incorrect classifications? This involves tuning the model to achieve the best balance of recall, precision, and overall accuracy.
  4. How do we handle data imbalances, missing values, and outliers, and how do these treatments affect the model's predictive capability? This is relevant to the data pre-processing and cleaning.
  5. How can the model's predictions be integrated into the loan approval process to effectively manage risk and optimize lending decisions? This is the implementation aspect.
  6. How can we properly scale the variables? This is relevant to the data pre-processing.

The Problem Formulation¶

From a data science perspective, this problem is framed as a binary classification task using supervised machine learning. We are trying to predict whether a loan will default (class 1) or not (class 0). We will train a classification model on a labeled dataset of historical loan applications, where each record contains applicant information and a label indicating whether the loan ultimately defaulted. The model will learn the patterns and relationships between the applicant's features and their likelihood of default. Once trained, the model will be used to classify new loan applications, assigning a probability of default, which will inform the lending decision. We will focus on recall as our main metric, and precision and accuracy as secondary metrics.

Data Description¶

The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. This adverse outcome occurred in 1,189 cases (20%). 12 input variables were registered for each applicant.

  • BAD: 1 = Client defaulted on loan, 0 = loan repaid

  • LOAN: Amount of loan approved.

  • MORTDUE: Amount due on the existing mortgage.

  • VALUE: Current value of the property.

  • REASON: Reason for the loan request. (HomeImp = home improvement, DebtCon= debt consolidation which means taking out a new loan to pay off other liabilities and consumer debts)

  • JOB: The type of job that loan applicant has such as manager, self, etc.

  • YOJ: Years at present job.

  • DEROG: Number of major derogatory reports (which indicates a serious delinquency or late payments).

  • DELINQ: Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due).

  • CLAGE: Age of the oldest credit line in months.

  • NINQ: Number of recent credit inquiries.

  • CLNO: Number of existing credit lines.

  • DEBTINC: Debt-to-income ratio (all your monthly debt payments divided by your gross monthly income. This number is one way lenders measure your ability to manage the monthly payments to repay the money you plan to borrow.

Data Overview Process¶

  • Loading and reading the dataset
  • Understanding the shape of the dataset
  • Checking the data types
  • Checking for missing values
  • Checking for duplicated values

Import the necessary libraries and Data¶

In [ ]:
import numpy as np  # Numerical computing library for array operations, linear algebra, etc.
import pandas as pd  # Data manipulation and analysis library for dataframes
import matplotlib.pyplot as plt  # Data visualization library for creating plots and charts
import seaborn as sns  # Data visualization library based on matplotlib, providing a high-level interface for statistical graphics
sns.set_theme()  # Sets the default theme for seaborn plots to a modern style

# Data splitting
from sklearn.model_selection import train_test_split  # Function for splitting datasets into training and testing sets

# Data scaling
from sklearn.preprocessing import StandardScaler  # Class for standardizing features by removing the mean and scaling to unit variance

# Models
from sklearn.linear_model import LogisticRegression  # Class for creating logistic regression models

# Model Evaluation and Metrics
from sklearn import metrics  # Module for various evaluation metrics and functions
from sklearn.metrics import (
    confusion_matrix,  # Function for creating a confusion matrix to evaluate classification accuracy
    classification_report,  # Function for generating a classification report with precision, recall, F1-score, etc.
    accuracy_score,  # Function for calculating the accuracy of a classification model
    precision_score,  # Function for calculating the precision of a classification model
    recall_score,  # Function for calculating the recall of a classification model
    f1_score,  # Function for calculating the F1-score of a classification model
)

# Tree Models
from sklearn import tree  # Module for decision tree-based models
from sklearn.tree import DecisionTreeClassifier  # Class for creating decision tree classifiers

# Ensemble Models
from sklearn.ensemble import BaggingClassifier  # Class for creating bagging ensemble models
from sklearn.ensemble import RandomForestClassifier  # Class for creating random forest ensemble models

# Statistical Analysis
import scipy.stats as stats  # Module for statistical functions and distributions
from scipy.stats import zscore  # Function for calculating the z-score of data

# Model Tuning
from sklearn.model_selection import GridSearchCV  # Class for performing hyperparameter tuning using grid search
from sklearn.model_selection import cross_val_score
# Warnings Handling
import warnings  # Module for managing warnings
warnings.filterwarnings('ignore')  # Function for ignoring all warnings

Load the dataset¶

In [ ]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

Read the dataset¶

In [ ]:
hmeq=pd.read_csv("/content/drive/MyDrive/MIT Applied Data Science Program/Capstone Project/hmeq.csv")
In [ ]:
# Make a copy of the original dataset for preservation
data=hmeq.copy()

Print the first and last 5 rows¶

In [ ]:
# Display the first five rows
data.head()
Out[ ]:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
0 1 1100 25860.0 39025.0 HomeImp Other 10.5 0.0 0.0 94.366667 1.0 9.0 NaN
1 1 1300 70053.0 68400.0 HomeImp Other 7.0 0.0 2.0 121.833333 0.0 14.0 NaN
2 1 1500 13500.0 16700.0 HomeImp Other 4.0 0.0 0.0 149.466667 1.0 10.0 NaN
3 1 1500 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0 1700 97800.0 112000.0 HomeImp Office 3.0 0.0 0.0 93.333333 0.0 14.0 NaN
In [ ]:
# Display the last five rows

data.tail()
Out[ ]:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
5955 0 88900 57264.0 90185.0 DebtCon Other 16.0 0.0 0.0 221.808718 0.0 16.0 36.112347
5956 0 89000 54576.0 92937.0 DebtCon Other 16.0 0.0 0.0 208.692070 0.0 15.0 35.859971
5957 0 89200 54045.0 92924.0 DebtCon Other 15.0 0.0 0.0 212.279697 0.0 15.0 35.556590
5958 0 89800 50370.0 91861.0 DebtCon Other 14.0 0.0 0.0 213.892709 0.0 16.0 34.340882
5959 0 89900 48811.0 88934.0 DebtCon Other 15.0 0.0 0.0 219.601002 0.0 16.0 34.571519

Describe the shape of the data¶

In [ ]:
# Complete the code to get the shape of data
print ('This data set contains', data.shape[0], 'rows and' , data.shape[1] , 'columns.')
This data set contains 5960 rows and 13 columns.

Check the data types of the columns of the data¶

In [ ]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5960 entries, 0 to 5959
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   BAD      5960 non-null   int64  
 1   LOAN     5960 non-null   int64  
 2   MORTDUE  5442 non-null   float64
 3   VALUE    5848 non-null   float64
 4   REASON   5708 non-null   object 
 5   JOB      5681 non-null   object 
 6   YOJ      5445 non-null   float64
 7   DEROG    5252 non-null   float64
 8   DELINQ   5380 non-null   float64
 9   CLAGE    5652 non-null   float64
 10  NINQ     5450 non-null   float64
 11  CLNO     5738 non-null   float64
 12  DEBTINC  4693 non-null   float64
dtypes: float64(9), int64(2), object(2)
memory usage: 605.4+ KB
In [ ]:
# checking for duplicate values
print(data.duplicated().sum())
0
In [ ]:
# Check the percentage of missing values in each column.
missing_values = data.isnull().sum()
total_rows = len(data)
percentage_missing = (missing_values / total_rows) * 100

# Create a DataFrame to display the results
missing_data_summary = pd.DataFrame({
    'Missing Values': missing_values,
    'Percentage Missing': percentage_missing
})

# Sort by 'Percentage Missing' in descending order
missing_data_summary = missing_data_summary.sort_values(by=['Percentage Missing'], ascending=False)

# Create a row for Total Rows using 'loc' and fill with total_rows
missing_data_summary.loc['Total Rows'] = [total_rows, np.nan]  # np.nan for Percentage Missing

# Reorder the columns
missing_data_summary = missing_data_summary[['Missing Values', 'Percentage Missing']]

# Display the DataFrame
print(missing_data_summary)
            Missing Values  Percentage Missing
DEBTINC             1267.0           21.258389
DEROG                708.0           11.879195
DELINQ               580.0            9.731544
MORTDUE              518.0            8.691275
YOJ                  515.0            8.640940
NINQ                 510.0            8.557047
CLAGE                308.0            5.167785
JOB                  279.0            4.681208
REASON               252.0            4.228188
CLNO                 222.0            3.724832
VALUE                112.0            1.879195
BAD                    0.0            0.000000
LOAN                   0.0            0.000000
Total Rows          5960.0                 NaN

Key Observations:

  • DEBTINC (Debt-to-Income Ratio): This column has the highest number of missing values, with 1,267 missing entries, representing approximately 21.26% of the total data. This is a significant amount of missing data and will need to be addressed carefully.
  • DEROG (Major Derogatory Reports): The second-highest number of missing values (708), accounting for 11.88% of the data.
  • DELINQ, MORTDUE, YOJ, NINQ: These columns have a moderate amount of missing data, ranging from approximately 8.56% to 9.73%.
  • CLAGE, JOB, REASON, CLNO: These have between 3.73% and 5.17% of missing values.
  • VALUE: Has a relatively low percentage of missing values at only 1.88%.

REASON and JOB are both type object and need to be converted to categorical

In [ ]:
# Select the type object columns
cols = data.select_dtypes(['object']).columns.tolist()
# Convert them to categories
for i in cols:
    data[i] = data[i].astype('category')
In [ ]:
cols
Out[ ]:
['REASON', 'JOB']

Summary Statistics¶

In [ ]:
# Get descriptive statistics for the numerical columns

data.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
BAD 5960.0 0.199497 0.399656 0.000000 0.000000 0.000000 0.000000 1.000000
LOAN 5960.0 18607.969799 11207.480417 1100.000000 11100.000000 16300.000000 23300.000000 89900.000000
MORTDUE 5442.0 73760.817200 44457.609458 2063.000000 46276.000000 65019.000000 91488.000000 399550.000000
VALUE 5848.0 101776.048741 57385.775334 8000.000000 66075.500000 89235.500000 119824.250000 855909.000000
YOJ 5445.0 8.922268 7.573982 0.000000 3.000000 7.000000 13.000000 41.000000
DEROG 5252.0 0.254570 0.846047 0.000000 0.000000 0.000000 0.000000 10.000000
DELINQ 5380.0 0.449442 1.127266 0.000000 0.000000 0.000000 0.000000 15.000000
CLAGE 5652.0 179.766275 85.810092 0.000000 115.116702 173.466667 231.562278 1168.233561
NINQ 5450.0 1.186055 1.728675 0.000000 0.000000 1.000000 2.000000 17.000000
CLNO 5738.0 21.296096 10.138933 0.000000 15.000000 20.000000 26.000000 71.000000
DEBTINC 4693.0 33.779915 8.601746 0.524499 29.140031 34.818262 39.003141 203.312149
In [ ]:
# Get descriptive statistics for the categorical columns
data.describe(include=['category']).T
Out[ ]:
count unique top freq
REASON 5708 2 DebtCon 3928
JOB 5681 6 Other 2388
In [ ]:
# Making a list of all categorical variables
cat_col = list(data.select_dtypes("category").columns) # Changed "object" to "category"

# Printing number of count of each unique value in each column
for col in cat_col:
    print(f"Value counts and percentages for column '{col}':\n")

    # Calculate value counts
    counts = data[col].value_counts()

    # Calculate percentages
    percentages = data[col].value_counts(normalize=True) * 100

    # Combine counts and percentages into a DataFrame for better display
    result_df = pd.DataFrame({'Count': counts, 'Percentage': percentages})

    print(result_df)
    print("\n")
Value counts and percentages for column 'REASON':

         Count  Percentage
REASON                    
DebtCon   3928   68.815697
HomeImp   1780   31.184303


Value counts and percentages for column 'JOB':

         Count  Percentage
JOB                       
Other     2388   42.034853
ProfExe   1276   22.460834
Office     948   16.687203
Mgr        767   13.501144
Self       193    3.397289
Sales      109    1.918676


Key Observations:

Loan Details:

  • LOAN (Loan Amount): Loan amounts show considerable variation, with an average of around 18,608. Most loan amounts fall between 11,100 and 23,300, but the range extends from a minimum of 1,100 to a maximum of 89,900.

  • MORTDUE (Mortgage Due): On average, the mortgage amount due is about 73,761, but there's substantial variability. The median mortgage due is 65,019.

Property and Job:

  • VALUE (Property Value): The average property value is approximately 101,776, with a wide range of values.
  • YOJ (Years at Job): On average, applicants have worked at their current job for nearly 9 years. Some applicants have 0 years of tenure, while others have over 40 years.

Credit History:

  • DEROG (Derogatory Reports): The majority of applicants have no major derogatory reports. However, some applicants have up to 10 derogatory reports.
  • DELINQ (Delinquent Credit Lines): Most applicants have no delinquent credit lines, but some have up to 15.
  • CLAGE (Age of Credit Line): The average age of the oldest credit line is about 180 months (15 years), but it varies from very new credit lines to over 97 years old.
  • NINQ (Recent Credit Inquiries): Applicants average around 1 recent credit inquiry, though some have up to 17.
  • CLNO (Credit Lines): Applicants have, on average, approximately 21 existing credit lines, but the number varies significantly.

Debt:

  • DEBTINC (Debt-to-Income): The average debt-to-income ratio is around 33.8, ranging from very low to very high. The median is 34.82.

General Observations:

  • The dataset consists of data type integer, float and object with no duplicate rows.
  • Missing Data: Several variables have missing data (MORTDUE, VALUE, YOJ, DEROG, DELINQ, CLAGE, NINQ, CLNO, DEBTINC), which requires attention during data cleaning.
  • Variability: Loan amounts, property values, mortgage due amounts, and credit-related factors all show significant variability.
  • Most Applicants: Most applicants have no delinquencies, or derogatory reports.
  • Outliers: There are some outliers in some of the variables.
  • Loan Defaults: Loan defaults are at 20%.

Exploratory Data Analysis (EDA) and Visualization¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Leading Questions:

  1. What is the range of values for the loan amount variable "LOAN"?
  2. How does the distribution of years at present job "YOJ" vary across the dataset?
  3. How many unique categories are there in the REASON variable?
  4. What is the most common category in the JOB variable?
  5. Is there a relationship between the REASON variable and the proportion of applicants who defaulted on their loan?
  6. Do applicants who default have a significantly different loan amount compared to those who repay their loan?
  7. Is there a correlation between the value of the property and the loan default rate?
  8. Do applicants who default have a significantly different mortgage amount compared to those who repay their loan?

Univariate Analysis¶

In [ ]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="cyan"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [ ]:
histogram_boxplot(data, "LOAN")
No description has been provided for this image
In [ ]:
histogram_boxplot(data,'MORTDUE')
No description has been provided for this image
In [ ]:
histogram_boxplot(data, 'VALUE')
No description has been provided for this image
In [ ]:
histogram_boxplot(data, 'YOJ')
No description has been provided for this image
In [ ]:
histogram_boxplot(data,'DEROG')
No description has been provided for this image
In [ ]:
histogram_boxplot(data,'DELINQ')
No description has been provided for this image
In [ ]:
histogram_boxplot(data,'CLAGE')
No description has been provided for this image
In [ ]:
histogram_boxplot(data,'NINQ')
No description has been provided for this image
In [ ]:
histogram_boxplot(data,'CLNO')
No description has been provided for this image
In [ ]:
histogram_boxplot(data,'DEBTINC')
No description has been provided for this image
In [ ]:
def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [ ]:
labeled_barplot(data, "BAD", perc=True)
No description has been provided for this image
In [ ]:
labeled_barplot(data, "REASON", perc=True)
No description has been provided for this image
In [ ]:
labeled_barplot(data, "JOB", perc=True)
No description has been provided for this image

Bivariate Analysis¶

In [ ]:
# Create a copy of the original data frame and only include the numeric columns.
data_numeric = data.select_dtypes(include=np.number).copy()

# Now, correctly add the 'BAD' column to data_numeric and change its type to int
data_numeric['BAD'] = data['BAD'].astype(int)
# %%
cols_list = data_numeric.columns.tolist() #Use the new numeric dataframe
In [ ]:
plt.figure(figsize=(12, 7))
sns.heatmap(
    data_numeric[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
) # Use the new numeric dataframe
plt.show()
No description has been provided for this image

Multivariate Analysis¶

In [ ]:
sns.boxplot(x=data["BAD"],y=data['LOAN'],palette="PuBu")
Out[ ]:
<Axes: xlabel='BAD', ylabel='LOAN'>
No description has been provided for this image
In [ ]:
sns.boxplot(x=data["BAD"],y=data['MORTDUE'],palette="PuBu")
Out[ ]:
<Axes: xlabel='BAD', ylabel='MORTDUE'>
No description has been provided for this image
In [ ]:
# Create a color-coded scatter plot
sns.scatterplot(x='VALUE', y='MORTDUE', hue='BAD', data=data.dropna(subset=['VALUE', 'MORTDUE', 'BAD']), palette={0: 'blue', 1: 'orange'})

plt.xlabel("Property Value")
plt.ylabel("Mortgage Due")
plt.show()
No description has been provided for this image
In [ ]:
# Create the pair plot
sns.pairplot(data_numeric)

plt.show()
No description has been provided for this image
In [ ]:
def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()
In [ ]:
stacked_barplot(data, 'JOB', 'BAD')
BAD         0     1   All
JOB                      
All      4515  1166  5681
Other    1834   554  2388
ProfExe  1064   212  1276
Mgr       588   179   767
Office    823   125   948
Self      135    58   193
Sales      71    38   109
------------------------------------------------------------------------------------------------------------------------
No description has been provided for this image
In [ ]:
# Create a grouped box plot
sns.boxplot(x='REASON', y='DEBTINC', hue='BAD', data=data)
plt.show()
No description has been provided for this image

EDA Key Observations:¶

Univariate Analysis:¶

  • Non-Defaulters: 80% of individuals do not default on loans. Since this is the target variable, we need to account for this significant imbalance to prevent bias in the modeling process.

  • Right-Skewness: Many of the numerical variables are right-skewed, indicating that extreme values in the higher range are less frequent but present.

  • Outliers: Most numerical variables have outliers, which might need to be addressed during data preprocessing, depending on the chosen method of handling them.

  • Class Imbalance: BAD has a noticeable class imbalance (80% non-default, 20% default), with far more loans being repaid than defaulted. This imbalance needs to be considered when modeling.

  • Categorical Variables: DebtCon dominant in REASON, and Other is dominant in JOB.

  • LOAN: Loan amount is right-skewed, with a few larger loans. sensitive to skewness.

  • MORTDUE: Similar to loan amount, mortgage due is also right-skewed, suggesting potential need for transformation.

  • VALUE: Property value exhibits a right-skewed distribution with some extreme values. Outlier handling might be necessary.

  • YOJ: Years at present job distribution may reveal common career progression stages.

  • DEROG, DELINQ, NINQ: These credit history and inquiry variables are highly skewed, indicating a few applicants with considerable negative marks.

  • CLAGE: The age of oldest credit line is right-skewed, implying a mix of credit histories.

  • CLNO: Number of credit lines is right-skewed, indicating a few applicants with a larger number of credit lines.

  • DEBTINC: Debt-to-income ratio shows a wide range and skewed distribution, potentially impacting model performance. Transformation or careful treatment may be needed.

Bivariate Analysis:¶

Key Correlations with Loan Default (BAD = 1):

  • DEBTINC (Debt-to-Income Ratio): DEBTINC has the strongest positive correlation with BAD at 0.24. As the debt-to-income ratio increases, the likelihood of loan default also increases significantly. This is a crucial indicator of risk.

  • DEROG (Major Derogatory Reports): There is a positive correlation of 0.18 between BAD and DEROG. More derogatory reports are associated with a higher probability of loan default.

  • DELINQ (Delinquent Credit Lines): BAD and DELINQ have a positive correlation of 0.15 indicating a greater number of delinquent credit lines increases the likelihood of default.

  • VALUE (Property Value): There is a notable negative correlation of -0.19 between BAD and VALUE. Higher property values are associated with a lower probability of default. This suggests that borrowers with more valuable properties are a lower risk.

Other Important Variable Correlations:

  • LOAN and MORTDUE: There is a moderate positive correlation of 0.65 between LOAN and MORTDUE suggesting that larger loans are often associated with higher outstanding mortgages.

  • MORTDUE and VALUE: There is a very strong positive correlation of 0.83 between MORTDUE and VALUE. Higher property values tend to be associated with higher mortgages.

  • VALUE and LOAN: There is a moderate positive correlation of 0.57 between VALUE and LOAN indicating higher value properties are associated with larger loans.

  • DEROG and DELINQ: There is a moderate positive correlation of 0.48 between DEROG and DELINQ indicating applicants with major derogatory reports also tend to have more delinquent credit lines.

  • CLNO and CLAGE: There is a moderate positive correlation of 0.36 between CLNO and CLAGE indicating applicants with older credit lines tend to have more credit lines.

  • NINQ and DELINQ: There is a moderate positive correlation of 0.30 betewen NINQ and DELINQ indicating applicants with more recent credit inquiries tend to have more delinquent credit lines.

  • YOJ and CLAGE: There is a moderate positive correlation of 0.20 between YOJ and CLAGE indicating applicants who have been at their job longer tend to have older credit lines.

Multivariate Analysis:¶

  • There's a positive correlation between Property Value and Mortgage Due. As property value increases, the mortgage due also tends to increase.
  • People who work in Sales has the highest number of defaults whereas people who work in an Office as the fewest.
  • As Property Value and Mortgage Due increase, the proportion of loans that default seems to decrease. This suggests that larger loans on more expensive properties are less likely to default.
  • The DEBTINC appears to be a useful variable in distinguishing between loans that have defaulted and been paid off, especially for home improvement loans.
  • Higher DEBTINC is strongly associated with defaulted loans in both categories.
  • The reason for the loan also plays a role. Home improvement loans, in general, exhibit higher debt-to-income ratios and a greater disparity between defaulted and paid off loans compared to debt consolidation loans.
  • The presence of outliers, particularly in the defaulted loan categories, warrants further investigation. These extreme values might represent specific cases with unique circumstances or could indicate data entry errors.

Treating Outliers¶

To handle outliers, we set a maximum or minimum threshold. Values exceeding these are capped at the threshold.

In [ ]:
# Functions to treat outliers
def treat_outliers(df, col):
    """Treats outliers in a variable using the IQR method."""
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    Lower_Whisker = Q1 - 1.5 * IQR
    Upper_Whisker = Q3 + 1.5 * IQR
    df[col] = np.clip(df[col], Lower_Whisker, Upper_Whisker)
    return df

def treat_outliers_all(df, col_list):
    """Treats outliers in all specified numerical columns."""
    for c in col_list:
        df = treat_outliers(df, c)
    return df
In [ ]:
#create a copy of the original dataframe so not to overwrite it
df_orig = data.copy()

# Get list of numerical columns, excluding BAD
numerical_col = df_orig.select_dtypes(include=np.number).columns.tolist()
numerical_col.remove("BAD")
# Apply the caps to the ouliers
df= treat_outliers_all(df=df_orig, col_list=numerical_col)

Treating Missing Values¶

Missing values for numerical data is filled using median whereas missing values for categorical is filled using mode.

In [ ]:
# Function to create missing value flags
def add_binary_flag(df, col):
    """Adds a binary flag column for missing values."""
    new_col = col + '_missing_values_flag'
    df[new_col] = df[col].isna().astype(int)
    return df
In [ ]:
# Add missing value flags
miss_col = [col for col in df.columns if df[col].isnull().any()]
for c in miss_col:
    df = add_binary_flag(df=df, col=c)  # Call the function to add flags
In [ ]:
# Treat missing data
# Select numeric columns.
n_data = df.select_dtypes('number')

# Select category columns.
c_data = df.select_dtypes('category').columns.tolist()

# Fill numeric columns with median.
df[n_data.columns] = n_data.fillna(n_data.median())

# Fill categorical columns with mode.
for column in c_data:
    mode = df[column].mode()[0]
    df[column] = df[column].fillna(mode)
In [ ]:
# Select only the numeric columns
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()
# dropping BAD as it is the target variable
numeric_columns.remove("BAD")
# Determine the number of rows and columns for subplots
num_cols = len(numeric_columns)
num_rows = (num_cols + 3) // 6  # Calculate rows needed for 4 columns per row

# Create a figure with the correct size
plt.figure(figsize=(15, num_rows * 4))  # Adjust height dynamically

# Iterate through each numeric column and create a box plot
for i, variable in enumerate(numeric_columns):
    plt.subplot(num_rows, 6, i + 1)  # Adjust subplot layout dynamically
    plt.boxplot(df[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()
No description has been provided for this image

Key Observations:¶

Outliers Detected: Several variables, such as MORTDUE, VALUE, YOJ, CLAGE, and DEBTINC, show clear outliers beyond the upper whisker.

Skewed Distributions: Some variables, like LOAN, MORTDUE, VALUE, and CLAGE, appear to have a right-skewed distribution, with a long upper tail.

Binary Variables: MORTDUE_missing_values_flag and VALUE_missing_values_flag are binary (0 or 1), which is why they show minimal variation.

In [ ]:
# Check the percentage of missing values in the each column.
missing_values = df.isnull().sum()
total_rows = len(df)
percentage_missing = (missing_values / total_rows) * 100

# Create a DataFrame to display the results
missing_data_summary = pd.DataFrame({
    'Missing Values': missing_values,
    'Percentage Missing': percentage_missing
})

# Sort by 'Percentage Missing' in descending order
missing_data_summary = missing_data_summary.sort_values(by=['Percentage Missing'], ascending=False)

# Create a row for Total Rows using 'loc' and fill with total_rows
missing_data_summary.loc['Total Rows'] = [total_rows, np.nan]  # np.nan for Percentage Missing

# Reorder the columns
missing_data_summary = missing_data_summary[['Missing Values', 'Percentage Missing']]

# Display the DataFrame
print(missing_data_summary)
                             Missing Values  Percentage Missing
BAD                                     0.0                 0.0
LOAN                                    0.0                 0.0
CLNO_missing_values_flag                0.0                 0.0
NINQ_missing_values_flag                0.0                 0.0
CLAGE_missing_values_flag               0.0                 0.0
DELINQ_missing_values_flag              0.0                 0.0
DEROG_missing_values_flag               0.0                 0.0
YOJ_missing_values_flag                 0.0                 0.0
JOB_missing_values_flag                 0.0                 0.0
REASON_missing_values_flag              0.0                 0.0
VALUE_missing_values_flag               0.0                 0.0
MORTDUE_missing_values_flag             0.0                 0.0
DEBTINC                                 0.0                 0.0
CLNO                                    0.0                 0.0
NINQ                                    0.0                 0.0
CLAGE                                   0.0                 0.0
DELINQ                                  0.0                 0.0
DEROG                                   0.0                 0.0
YOJ                                     0.0                 0.0
JOB                                     0.0                 0.0
REASON                                  0.0                 0.0
VALUE                                   0.0                 0.0
MORTDUE                                 0.0                 0.0
DEBTINC_missing_values_flag             0.0                 0.0
Total Rows                           5960.0                 NaN

All missing values have been accounted for.

In [ ]:
# Select columns that have "_missing_values_flag" in their name.
missing_flag_cols = [col for col in df.columns if "_missing_values_flag" in col]

# Create a new DataFrame with only the missing value flag columns
missing_flag_df = df[missing_flag_cols]

# Calculate the correlation matrix
corr = missing_flag_df.corr()

# Plot the heatmap
plt.figure(figsize=(10, 8))  # Adjust the size as needed
sns.heatmap(corr, cmap='coolwarm', annot=True, fmt=".2f", vmin=-1, vmax=1)
plt.title("Heatmap of Missing Value Flag Correlations")
plt.show()
No description has been provided for this image

Key Observations:

  • Co-occurring Missingness: The heatmap show that certain variables tend to be missing together. For example, if MORTDUE_missing_values_flag and VALUE_missing_values_flag show a strong positive correlation, it suggests that when mortgage due information is missing, property value information is also frequently missing, and vice-versa. This indicates a non-random pattern of missing data.

  • Credit history missingness: The heatmap shows that there is a positive correlation between the credit variables DEROG_missing_values_flag and DELINQ_missing_values_flag. This means that if one of the variables is missing, the other is also likely to be missing. This is likely due to these variables being related to credit history.

Important Insights from EDA¶

What are the the most important observations and insights from the data based on the EDA performed?

Loan Default Patterns:

  • Class Imbalance: 20% of loans defaulted, 80% were repaid. Model adjustments needed to prevent bias.

Numerical Variable Characteristics:

  • Skewness & Outliers: LOAN, MORTDUE, VALUE, YOJ, and CLAGE are right-skewed with outliers.
  • Variability: Significant differences in financial profiles among applicants.

Categorical Variable Characteristics:

  • Loan Purpose: Debt consolidation was the most common reason.
  • Job & Default: Sales professionals had the highest default rate, while office workers had the lowest.

Missing Data:

  • Key missing variables: DEBTINC (21.26%), DEROG, DELINQ, MORTDUE, YOJ, VALUE.
  • Patterns: Missing values in DEROG & DELINQ were often linked. Missing values were found to be indicators of default risk.

Key Variable Relationships:

  • Higher Risk of Default:
    • High DEBTINC (Debt-to-income ratio).
    • More DEROG (derogatory credit reports) and DELINQ (delinquent credit lines).
  • Lower Risk of Default:
    • Higher VALUE (property value).
    • Longer CLAGE (age of oldest credit line).
  • Notable Correlations:
    • MORTDUE & VALUE: Higher property values correlate with higher mortgage amounts.
    • LOAN & MORTDUE: Larger loans tend to be associated with larger mortgages.
    • DELINQ & DEROG: Applicants with derogatory reports often have delinquencies.
    • Loan & Default: Defaulters had larger loan and mortgage amounts.
    • Property Value & Default: Defaulters had lower property values for similar mortgage amounts.
    • DEBTINC & Loan Purpose: Key factor in distinguishing risk, especially for home improvement loans.
    • Applicant Profile: Most applicants had no delinquencies or derogatory reports.

Overall Implications:

  • Key Risk Factors: DEBTINC, DEROG, DELINQ, VALUE, YOJ, CLAGE are strong predictors.
  • Data Quality: Missing data and outliers required careful handling.
  • Model Considerations: Class imbalance, outliers, and variable relationships impact model selection and feature engineering.

Model Building - Approach¶

  • Data preparation
  • Partition the data into train and test set
  • Build the model
  • Fit on the train data
  • Tune the model
  • Test the model on test set

Partition the data into train and test set¶

In [ ]:
# Separate X and Y
X = df.drop(["BAD"], axis=1)
Y = df["BAD"]

# One-Hot Encoding for the categorical variables
X = pd.get_dummies(X, drop_first=True)

# Convert missing value flags back to int (after get_dummies)
miss_col = [col for col in X.columns if 'missing_values_flag' in col]
X[miss_col] = X[miss_col].astype(int)

# Split the Data (no stratification for now)
X_train, X_test, y_train, y_test = train_test_split(
    X, Y, test_size=0.3, random_state=1, stratify=Y
)
In [ ]:
# Print the Results
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set :  (4172, 27)
Shape of test set :  (1788, 27)
Percentage of classes in training set:
BAD
0    0.800575
1    0.199425
Name: proportion, dtype: float64
Percentage of classes in test set:
BAD
0    0.800336
1    0.199664
Name: proportion, dtype: float64
In [ ]:
#creating metric function
def metrics_score(actual, predicted):
    print(classification_report(actual, predicted))

    cm = confusion_matrix(actual, predicted)

    plt.figure(figsize = (8, 5))

    sns.heatmap(cm, annot = True,  fmt = '.2f', xticklabels = ['Not Converted', 'Converted'], yticklabels = ['Not Converted', 'Converted'])

    plt.ylabel('Actual')

    plt.xlabel('Predicted')

    plt.show()

Logistic Regression¶

In [ ]:
# 1. Create a StandardScaler object
scaler = StandardScaler()

# 2. Fit the scaler on the training data and transform it
X_train_scaled = scaler.fit_transform(X_train)

# 3. Transform the test data using the same scaler
X_test_scaled = scaler.transform(X_test)

# 4. Define the Logistic regression model (using scaled data)
model = LogisticRegression()

# 5. Fit the model on the scaled training data
model.fit(X_train_scaled, y_train)
Out[ ]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()

Checking the performance on the LR train dataset¶

In [ ]:
#Predict for train set
y_train_pred = model.predict(X_train_scaled)

#checking the performance on the train dataset
metrics_score(y_train, y_train_pred)
              precision    recall  f1-score   support

           0       0.91      0.94      0.92      3340
           1       0.71      0.61      0.65       832

    accuracy                           0.87      4172
   macro avg       0.81      0.77      0.79      4172
weighted avg       0.87      0.87      0.87      4172

No description has been provided for this image

Checking the performance on the LR test dataset¶

In [ ]:
#Predict for test set
y_test_pred = model.predict(X_test_scaled)

#checking the performance on the test dataset
metrics_score(y_test, y_test_pred)
              precision    recall  f1-score   support

           0       0.90      0.95      0.93      1431
           1       0.75      0.59      0.66       357

    accuracy                           0.88      1788
   macro avg       0.82      0.77      0.79      1788
weighted avg       0.87      0.88      0.87      1788

No description has been provided for this image

Observations¶

The model is good at detecting non defaulters, but not very good at detecting defaulters. There is a need to improve the ability of the model to detect defaulters. There is a need to improve the recall of the model.

In [ ]:
# Get the coefficients and feature names
coefficients = model.coef_[0]  # Get the array of coefficients
feature_names = X_train.columns  # Get the feature names

# Create a DataFrame to hold the coefficients and feature names
coef_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})
print(coef_df)
print(model.intercept_)
                        Feature  Coefficient
0                          LOAN    -0.073229
1                       MORTDUE    -0.191291
2                         VALUE     0.100110
3                           YOJ    -0.159877
4                         DEROG     0.000000
5                        DELINQ     0.000000
6                         CLAGE    -0.462167
7                          NINQ     0.206908
8                          CLNO     0.067103
9                       DEBTINC     0.688868
10  MORTDUE_missing_values_flag     0.154270
11    VALUE_missing_values_flag     0.725203
12   REASON_missing_values_flag     0.011399
13      JOB_missing_values_flag    -0.442760
14      YOJ_missing_values_flag    -0.234999
15    DEROG_missing_values_flag    -0.291321
16   DELINQ_missing_values_flag    -0.405556
17    CLAGE_missing_values_flag     0.249286
18     NINQ_missing_values_flag     0.031493
19     CLNO_missing_values_flag     0.322417
20  DEBTINC_missing_values_flag     1.161286
21               REASON_HomeImp     0.114141
22                   JOB_Office    -0.196002
23                    JOB_Other    -0.019564
24                  JOB_ProfExe    -0.144800
25                    JOB_Sales     0.114879
26                     JOB_Self     0.059503
[-2.13764081]
In [ ]:
# Get the coefficients and feature names
coefficients = model.coef_[0]  # Get the array of coefficients
feature_names = X_train.columns  # Get the feature names from the original (unscaled) data

# Create a DataFrame to hold the coefficients and feature names
coef_df = pd.DataFrame({'Feature': feature_names, 'Coefficient': coefficients})

# Sort the DataFrame by the absolute value of the coefficients
coef_df['Abs_Coefficient'] = abs(coef_df['Coefficient'])
coef_df = coef_df.sort_values('Abs_Coefficient', ascending=False)

# Plot the coefficients
plt.figure(figsize=(10, 6))
plt.bar(coef_df['Feature'], coef_df['Coefficient'])
plt.xlabel('Feature')
plt.ylabel('Coefficient Value')
plt.title('Logistic Regression Coefficients')
plt.xticks(rotation=90)  # Rotate x-axis labels for readability
plt.tight_layout()
plt.show()
No description has been provided for this image

Feature Importance:

  • Features indicating missing values, such as DEBTINC_missing_values_flag, have a strong positive impact, suggesting missingness itself is a red flag.
  • DEBTINC (debt-to-income ratio) is a major negative predictor, while credit history indicators (CLAGE, DELINQ, DEROG) also significantly influence the model.

Decision Tree¶

For the decision tree, we will continue to use the same cleansed data set as the logistic regression model for a true comparison.

In [ ]:
# Fitting the decision tree classifier on the training data
d_tree = DecisionTreeClassifier(random_state=1)  # Initialize the DecisionTreeClassifier

d_tree.fit(X_train, y_train)  # Fit the model to the training data
Out[ ]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)

Check performance on the default decision tree training dataset¶

In [ ]:
y_pred_train = d_tree.predict(X_train)
metrics_score(y_train,y_pred_train)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3340
           1       1.00      1.00      1.00       832

    accuracy                           1.00      4172
   macro avg       1.00      1.00      1.00      4172
weighted avg       1.00      1.00      1.00      4172

No description has been provided for this image

Check performance on the default decision tree test dataset¶

In [ ]:
y_pred_test = d_tree.predict(X_test)
metrics_score(y_test,y_pred_test)
              precision    recall  f1-score   support

           0       0.90      0.93      0.91      1431
           1       0.67      0.59      0.63       357

    accuracy                           0.86      1788
   macro avg       0.78      0.76      0.77      1788
weighted avg       0.85      0.86      0.86      1788

No description has been provided for this image

Observations¶

The performance on the test data is significantly worse than on the training data indicating overfitting. The next step is the tune the decision tree using hyperparameter tuning.

Decision Tree - Hyperparameter Tuning¶

  • Hyperparameter tuning is tricky in the sense that there is no direct way to calculate how a change in the hyperparameter value will reduce the loss of your model, so we usually resort to experimentation. We'll use Grid search to perform hyperparameter tuning.
  • Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters.
  • It is an exhaustive search that is performed on the specific parameter values of a model.
  • The parameters of the estimator/model used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

Criterion {“gini”, “entropy”}

The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.

max_depth

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_leaf

The minimum number of samples is required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

You can learn about more Hyperpapameters on this link and try to tune them.

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [ ]:
# Choose the type of classifier
d_tree_tuned = DecisionTreeClassifier(random_state = 1, class_weight = {0: 0.3, 1: 0.7})

# Grid of parameters to choose from
parameters = {'max_depth': np.arange(3, 10),
              'criterion': ['gini', 'entropy'],
              'min_samples_leaf': [1, 5, 10, 20, 25],
              'min_samples_split': [2, 5, 10],
              'max_features': ['sqrt', 'log2'],

             }

# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)

# Run the grid search
grid_obj = GridSearchCV(d_tree_tuned, parameters, scoring = scorer, cv = 5)

grid_obj = grid_obj.fit(X_train, y_train)

# Set the classifier to the best combination of parameters
d_tree_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data
d_tree_tuned.fit(X_train, y_train)
Out[ ]:
DecisionTreeClassifier(class_weight={0: 0.3, 1: 0.7}, criterion='entropy',
                       max_depth=9, max_features='sqrt', min_samples_leaf=5,
                       random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight={0: 0.3, 1: 0.7}, criterion='entropy',
                       max_depth=9, max_features='sqrt', min_samples_leaf=5,
                       random_state=1)

Checking performance on the tuned decision tree training dataset¶

In [ ]:
# Checking performance on the training data
y_pred_train2 = d_tree_tuned.predict(X_train)  # Predict on the training data using the *tuned* model
metrics_score(y_train, y_pred_train2)         # Evaluate using your metrics_score function
              precision    recall  f1-score   support

           0       0.90      0.92      0.91      3340
           1       0.64      0.59      0.62       832

    accuracy                           0.85      4172
   macro avg       0.77      0.75      0.76      4172
weighted avg       0.85      0.85      0.85      4172

No description has been provided for this image

Checking performance on the tuned decision tree test dataset¶

In [ ]:
# Checking performance on the testing data
y_pred_test2 = d_tree_tuned.predict(X_test)  # Use the *tuned* model: d_tree_tuned

metrics_score(y_test, y_pred_test2)  # Evaluate using your 'metrics_score' function
              precision    recall  f1-score   support

           0       0.87      0.91      0.89      1431
           1       0.56      0.48      0.51       357

    accuracy                           0.82      1788
   macro avg       0.72      0.69      0.70      1788
weighted avg       0.81      0.82      0.81      1788

No description has been provided for this image

Observations¶

The tuned decision tree is still not performing well with identifying true loan defaulters. Although it is now closer to the tuned training model, only 48% of actual defaulters are being identified, which is worst than the default training model.

Plot the Tuned Decision Tree

In [ ]:
features = list(X.columns)

plt.figure(figsize = (20, 10))

tree.plot_tree(d_tree_tuned, feature_names = features, filled = True, fontsize = 9, node_ids = True, class_names = True)

plt.show()
No description has been provided for this image

Observations:

  • CLNO is the highest node level in the decision tree.
  • Blue leaves represent the defaulters, y[1], while the orange leaves represent the non-defaulters, y[0]. Also, the more the number of observations in a leaf, the darker its color gets.
  • This decision tree is to large and hard to view and needs to be pruned.
In [ ]:
# Plotting the feature importance
importances = d_tree_tuned.feature_importances_

indices = np.argsort(importances)

plt.figure(figsize = (10, 10))

plt.title('Feature Importances')

plt.barh(range(len(indices)), importances[indices], color = 'violet', align = 'center')

plt.yticks(range(len(indices)), [features[i] for i in indices])

plt.xlabel('Relative Importance')

plt.show()
No description has been provided for this image

Feature Importance:

  • DEBTINC and DEBTINC_values_missing_flag have the highest level of immportance, followed by VALUE_missing_values_flag and CLAGE.

Prune the Decision Tree¶

Pruning the decision tree to see if we can get better results at detecting defaulters.

In [ ]:
# Fit a full decision tree (no pruning) to calculate ccp_alphas
tree_clf = DecisionTreeClassifier(random_state=1)
tree_clf.fit(X_train, y_train)

# Find the effective alpha values for pruning
path = tree_clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities

# Plotting the impurities against the alphas
plt.figure(figsize=(8, 6))
plt.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
plt.xlabel("effective alpha")
plt.ylabel("total impurity of leaves")
plt.title("Total Impurity vs effective alpha for training set")
plt.show()

# Create an array to hold the trees
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)  # Fit the classifier
    clfs.append(clf)

# Removing the last value because it just creates an empty tree
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

# Create an array to hold the accuracy results for each tree
train_scores = [
    cross_val_score(clf, X_train, y_train, cv=5).mean() for clf in clfs
]
test_scores = [
    cross_val_score(clf, X_test, y_test, cv=5).mean() for clf in clfs
]

# Plotting the Accuracy vs Alphas
plt.figure(figsize=(8, 6))
plt.plot(ccp_alphas, test_scores, marker="o", label="Test Accuracy")
plt.plot(ccp_alphas, train_scores, marker="o", label="Train Accuracy")
plt.xlabel("alpha")
plt.ylabel("Accuracy")
plt.title("Accuracy vs alpha")
plt.legend()
plt.show()
No description has been provided for this image
No description has been provided for this image

Observations:

  • The impurity vs. alpha graph suggests that pruning is needed to avoid overfitting.

  • The decision tree is overfitting at low alpha. Including ccp_alpha in the hyperparameter tuning will help find the right level of pruning.

In [ ]:
# Finding the best alpha and best tree
best_index = np.argmax(test_scores)
best_alpha = ccp_alphas[best_index]
best_tree = clfs[best_index]
In [ ]:
# Choose the type of classifier with class weight adjustments
# and the best alpha
d_tree_tuned_with_pruning = DecisionTreeClassifier(
    random_state=1, ccp_alpha=best_alpha
)

# Grid of parameters to choose from
parameters = {
    'max_depth': np.arange(3, 10),
    'criterion': ['gini', 'entropy'],
    'min_samples_leaf': [1, 5, 10, 20],
    'min_samples_split': [2, 5],
    'max_features': ['sqrt', 'log2'],
    "class_weight": [
        None,
        "balanced",
        {0: 0.3, 1: 0.7},
    ],  # Different class weights
}

# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label=1)

# Run the grid search
grid_obj_pruned = GridSearchCV(d_tree_tuned_with_pruning, parameters, scoring=scorer, cv=5, n_jobs = -1)
grid_obj_pruned.fit(X_train, y_train)

# Set the classifier to the best combination of parameters
d_tree_tuned_with_pruning = grid_obj_pruned.best_estimator_

# Fit the best algorithm to the data
d_tree_tuned_with_pruning.fit(X_train, y_train)
Out[ ]:
DecisionTreeClassifier(ccp_alpha=0.004106586322522338, class_weight='balanced',
                       criterion='entropy', max_depth=4, max_features='sqrt',
                       min_samples_leaf=10, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(ccp_alpha=0.004106586322522338, class_weight='balanced',
                       criterion='entropy', max_depth=4, max_features='sqrt',
                       min_samples_leaf=10, random_state=1)

Check the performance on the tuned decision tree with pruning training dataset¶

In [ ]:
# Checking performance on the training data
y_pred_train3 = d_tree_tuned_with_pruning.predict(X_train)  # Predict on the training data using the *tuned* model
metrics_score(y_train, y_pred_train3)         # Evaluate using your metrics_score function
              precision    recall  f1-score   support

           0       0.92      0.86      0.89      3340
           1       0.56      0.71      0.63       832

    accuracy                           0.83      4172
   macro avg       0.74      0.79      0.76      4172
weighted avg       0.85      0.83      0.84      4172

No description has been provided for this image

Check the performance on the tuned decision tree with pruning test dataset¶

In [ ]:
# Checking performance on the testing data
y_pred_test3 = d_tree_tuned_with_pruning.predict(X_test)  # Use the *tuned* model

metrics_score(y_test, y_pred_test3)  # Evaluate using your 'metrics_score' function
              precision    recall  f1-score   support

           0       0.92      0.87      0.89      1431
           1       0.57      0.69      0.62       357

    accuracy                           0.83      1788
   macro avg       0.74      0.78      0.76      1788
weighted avg       0.85      0.83      0.84      1788

No description has been provided for this image
In [ ]:
features = list(X.columns)

plt.figure(figsize = (20, 10))

tree.plot_tree(d_tree_tuned_with_pruning, feature_names = features, filled = True, fontsize = 9, node_ids = True, class_names = True)

plt.show()
No description has been provided for this image

Observations:

  • CLNO is the highest node level in the decision tree.
  • Blue leaves represent the defaulters, y[1], while the orange leaves represent the non-defaulters, y[0]. Also, the more the number of observations in a leaf, the darker its color gets.
  • This decision tree has only a few nodes compared to the tuned decision tree and is easier to interpret.

Observations¶

The tuned decision tree with pruning performed well compared to the pruned training dataset. The recall on the test data is now 69%, compared to 48% from the previously tuned model.The pruned decision tree improved accuracy on the test set and reduced overfitting, improving generalization.

In [ ]:
# Plotting the feature importance
importances = d_tree_tuned_with_pruning.feature_importances_

indices = np.argsort(importances)

plt.figure(figsize = (10, 10))

plt.title('Feature Importances')

plt.barh(range(len(indices)), importances[indices], color = 'violet', align = 'center')

plt.yticks(range(len(indices)), [features[i] for i in indices])

plt.xlabel('Relative Importance')

plt.show()
No description has been provided for this image

Feature Importance:

  • DECTINC, VALUE_values_mising_flag and CLNO were the only features detected with the most importance.

Building a Random Forest Classifier¶

Random Forest is a bagging algorithm where the base models are Decision Trees. Samples are taken from the training data and on each sample a decision tree makes a prediction.

The results from all the decision trees are combined together and the final prediction is made using voting or averaging.

In [ ]:
# Fitting the random forest tree classifier on the training data
rf_estimator = RandomForestClassifier(random_state=1)  # Initialize with random_state for reproducibility
rf_estimator.fit(X_train, y_train)  # Fit the model to the training data
Out[ ]:
RandomForestClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(random_state=1)

Checking the performance on the random forest train dataset¶

In [ ]:
y_pred_train4 = rf_estimator.predict(X_train)  # Predict on the training data using the *trained*
metrics_score(y_train, y_pred_train4)         # Evaluate using your metrics_score function
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3340
           1       1.00      1.00      1.00       832

    accuracy                           1.00      4172
   macro avg       1.00      1.00      1.00      4172
weighted avg       1.00      1.00      1.00      4172

No description has been provided for this image

Checking the performance on the random forest test dataset¶

In [ ]:
# Checking performance on the testing data
y_pred_test4 = rf_estimator.predict(X_test)  # Predict on the test data using the *trained*
metrics_score(y_test, y_pred_test4)         # Evaluate using your metrics_score function
              precision    recall  f1-score   support

           0       0.92      0.95      0.94      1431
           1       0.79      0.68      0.73       357

    accuracy                           0.90      1788
   macro avg       0.85      0.82      0.83      1788
weighted avg       0.90      0.90      0.90      1788

No description has been provided for this image

Observations:¶

The model scored perfectly on the train data, but on 68% on the test indicating overfitting. Let's try tuning the data.

Random Forest Classifier Hyperparameter Tuning¶

In [ ]:
# Choose the type of classifier
rf_estimator_tuned = RandomForestClassifier(criterion = "entropy", random_state = 1)

# Grid of parameters to choose from
parameters = {"n_estimators": [110, 120],
    "max_depth": [6, 7],
    "min_samples_leaf": [20, 25],
    "max_features": [0.8, 0.9],
    "max_samples": [0.9, 1],
    "class_weight": ["balanced",{0: 0.3, 1: 0.7}]
             }

# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)

# Run the grid search on the training data using scorer=scorer and cv=5
grid_obj = GridSearchCV(rf_estimator_tuned, parameters, scoring=scorer, cv=5)

grid_obj = grid_obj.fit(X_train, y_train)

# Save the best estimator to variable rf_estimator_tuned
rf_estimator_tuned = grid_obj.best_estimator_

#Fit the best estimator to the training data
rf_estimator_tuned.fit(X_train, y_train)
Out[ ]:
RandomForestClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=7, max_features=0.9, max_samples=0.9,
                       min_samples_leaf=25, n_estimators=110, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight='balanced', criterion='entropy',
                       max_depth=7, max_features=0.9, max_samples=0.9,
                       min_samples_leaf=25, n_estimators=110, random_state=1)

Check the performance on the random forest tuned performance dataset

In [ ]:
# Checking performance on the training data
y_pred_train5 = rf_estimator_tuned.predict(X_train)  # Predict on the training data using the *tuned* rf_estimator
metrics_score(y_train, y_pred_train5)         # Evaluate using your metrics_score function
              precision    recall  f1-score   support

           0       0.94      0.89      0.91      3340
           1       0.63      0.77      0.70       832

    accuracy                           0.87      4172
   macro avg       0.79      0.83      0.81      4172
weighted avg       0.88      0.87      0.87      4172

No description has been provided for this image

Check the performance on the random forest tuned test dataset

In [ ]:
# Checking performance on the test data
y_pred_test5 = rf_estimator_tuned.predict(X_test)  # Predict on the test data using the *tuned*
metrics_score(y_test, y_pred_test5)         # Evaluate using your metrics_score function
              precision    recall  f1-score   support

           0       0.93      0.89      0.91      1431
           1       0.63      0.74      0.68       357

    accuracy                           0.86      1788
   macro avg       0.78      0.82      0.80      1788
weighted avg       0.87      0.86      0.87      1788

No description has been provided for this image

Observations:

  • The tuned random forest peformed better than the untuned test dataset (up 77% from 68%), and very close to the tuned random forest training. This model will be a better job at predicting defaulters.
In [ ]:
importances = rf_estimator_tuned.feature_importances_

indices = np.argsort(importances)

feature_names = list(X.columns)

plt.figure(figsize = (12, 12))

plt.title('Feature Importances')

plt.barh(range(len(indices)), importances[indices], color = 'violet', align = 'center')

plt.yticks(range(len(indices)), [feature_names[i] for i in indices])

plt.xlabel('Relative Importance')

plt.show()
No description has been provided for this image

Key Observations

  • The tuned Random Forest model performed well. The recall score on the test set increased from 68% to 77%.
  • There were several fields with strong feature importance with DEBTINC_values_missing_flag and DEBTINC being the top two.

Next step is to tune the random forest with weights¶

In [ ]:
# Define the base model (Random Forest)
rf_estimator_weights = RandomForestClassifier(random_state=1)

# Define the (much smaller) parameter grid for GridSearchCV
param_grid_weights = {
    "n_estimators": [80, 160],
    "max_depth": [None, 6],
    "min_samples_leaf": [10, 30],
    "min_samples_split": [2, 5],
    "max_features": [0.6, 1.0],
    "max_samples": [0.7, 0.9],
    "criterion": ["gini", "entropy"],
    "class_weight": [
        None,
        "balanced",
        {0: 0.3, 1: 0.7},
    ],
}

# Define the scorer (recall for class 1)
scorer = metrics.make_scorer(recall_score, pos_label=1)

# Run GridSearchCV
grid_search_weights = GridSearchCV(
    rf_estimator_weights,
    param_grid=param_grid_weights,
    scoring=scorer,
    cv=3,  # Reduced cross-validation folds
    n_jobs=-1,  # Use all available cores
)

grid_search_weights.fit(X_train, y_train)

# Get the best estimator
rf_estimator_tuned_weights = grid_search_weights.best_estimator_

# Print the best parameters and best score
print("Best parameters found (with weights):", grid_search_weights.best_params_)
print("Best recall score (with weights):", grid_search_weights.best_score_)

# Retrain on the full training set
rf_estimator_tuned_weights.fit(X_train, y_train)
Best parameters found (with weights): {'class_weight': 'balanced', 'criterion': 'gini', 'max_depth': None, 'max_features': 1.0, 'max_samples': 0.9, 'min_samples_leaf': 10, 'min_samples_split': 2, 'n_estimators': 160}
Best recall score (with weights): 0.7656329809798371
Out[ ]:
RandomForestClassifier(class_weight='balanced', max_features=1.0,
                       max_samples=0.9, min_samples_leaf=10, n_estimators=160,
                       random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight='balanced', max_features=1.0,
                       max_samples=0.9, min_samples_leaf=10, n_estimators=160,
                       random_state=1)

Checking performance on the tuned randon forest with weights training dataset¶

In [ ]:
# Checking performance on the training data
y_pred_train6 = rf_estimator_tuned_weights.predict(X_train) # Predict on the training data using the *tuned*
metrics_score(y_train, y_pred_train6)         # Evaluate using your metrics_score function
              precision    recall  f1-score   support

           0       0.97      0.89      0.93      3340
           1       0.67      0.87      0.76       832

    accuracy                           0.89      4172
   macro avg       0.82      0.88      0.84      4172
weighted avg       0.91      0.89      0.89      4172

No description has been provided for this image

Checking performance on the tuned randon forest with weights test dataset¶

In [ ]:
# Checking performance on the test data
y_pred_test6 = rf_estimator_tuned_weights.predict(X_test) # Predict on the test data using the *tuned* rf_estimator
metrics_score(y_test, y_pred_test6)         # Evaluate using your metrics_score function
              precision    recall  f1-score   support

           0       0.94      0.89      0.92      1431
           1       0.64      0.78      0.71       357

    accuracy                           0.87      1788
   macro avg       0.79      0.84      0.81      1788
weighted avg       0.88      0.87      0.87      1788

No description has been provided for this image

Plot the importance of the tuned random forest with weights

In [ ]:
importances = rf_estimator_tuned_weights.feature_importances_

indices = np.argsort(importances)

feature_names = list(X.columns)

plt.figure(figsize = (12, 12))

plt.title('Feature Importances')

plt.barh(range(len(indices)), importances[indices], color = 'violet', align = 'center')

plt.yticks(range(len(indices)), [feature_names[i] for i in indices])

plt.xlabel('Relative Importance')

plt.show()
No description has been provided for this image

Feature Importance:

  • The features with the strongest importance are DEBTINC_values_missing_flag and DEBTINC, followed by CLAGE.
In [ ]:
def get_recall_score(model,flag=True,X_train=X_train,X_test=X_test):
    '''
    model : classifier to predict values of X

    '''
    a = [] # defining an empty list to store train and test results
    pred_train = model.predict(X_train)
    pred_test = model.predict(X_test)
    train_recall = metrics.recall_score(y_train,pred_train)
    test_recall = metrics.recall_score(y_test,pred_test)
    a.append(train_recall) # adding train recall to list
    a.append(test_recall) # adding test recall to list

    # If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
    if flag == True:
        print("Recall on training set : ",metrics.recall_score(y_train,pred_train))
        print("Recall on test set : ",metrics.recall_score(y_test,pred_test))

    return a # returning the list with train and test scores
In [ ]:
#  Function to calculate precision score
def get_precision_score(model,flag=True,X_train=X_train,X_test=X_test):
    '''
    model : classifier to predict values of X

    '''
    b = []  # defining an empty list to store train and test results
    pred_train = model.predict(X_train)
    pred_test = model.predict(X_test)
    train_precision = metrics.precision_score(y_train,pred_train)
    test_precision = metrics.precision_score(y_test,pred_test)
    b.append(train_precision) # adding train precision to list
    b.append(test_precision) # adding test precision to list

    # If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
    if flag == True:
        print("Precision on training set : ",metrics.precision_score(y_train,pred_train))
        print("Precision on test set : ",metrics.precision_score(y_test,pred_test))

    return b # returning the list with train and test scores
In [ ]:
def get_accuracy_score(model,flag=True,X_train=X_train,X_test=X_test):
    '''
    model : classifier to predict values of X

    '''
    c = [] # defining an empty list to store train and test results
    train_acc = model.score(X_train,y_train)
    test_acc = model.score(X_test,y_test)
    c.append(train_acc) # adding train accuracy to list
    c.append(test_acc) # adding test accuracy to list

    # If the flag is set to True then only the following print statements will be dispayed. The default value is set to True.
    if flag == True:
        print("Accuracy on training set : ",model.score(X_train,y_train))
        print("Accuracy on test set : ",model.score(X_test,y_test))

    return c # returning the list with train and test scores
In [ ]:
# Create a dictionary to store your models
models = {
    "Logistic Regression": model,
    "Default Decision Tree": d_tree,
    "Tuned Decision Tree": d_tree_tuned,
    "Tuned Decision Tree with Pruning": d_tree_tuned_with_pruning,
    "Default Random Forest": rf_estimator,
    "Tuned Random Forest": rf_estimator_tuned,
    "Tuned Random Forest with weights": rf_estimator_tuned_weights,
}


# Define a function to calculate and print metrics for a given model
def get_metrics(model, model_name, X_train, y_train, X_test, y_test):
    """
    Calculates and prints accuracy, recall, and precision for a given model on both train and test data.

    Args:
        model: The trained machine learning model.
        model_name: The name of the model (string).
        X_train: Training data features.
        y_train: Training data labels.
        X_test: Test data features.
        y_test: Test data labels.

    Returns:
        A dictionary containing the metrics for train and test data.
    """
    # Predict on train data
    y_train_pred = model.predict(X_train)
    # Predict on test data
    y_test_pred = model.predict(X_test)

    # Calculate and store train metrics
    train_accuracy = accuracy_score(y_train, y_train_pred)
    train_recall = recall_score(y_train, y_train_pred)
    train_precision = precision_score(y_train, y_train_pred)

    # Calculate and store test metrics
    test_accuracy = accuracy_score(y_test, y_test_pred)
    test_recall = recall_score(y_test, y_test_pred)
    test_precision = precision_score(y_test, y_test_pred)


    # Return results
    return {
        "Model": model_name,
        "Train Accuracy": train_accuracy,
        "Test Accuracy": test_accuracy,
        "Train Recall": train_recall,
        "Test Recall": test_recall,
        "Train Precision": train_precision,
        "Test Precision": test_precision,
    }

# defining an empty list to store the results for each model
results = []

# looping through all the models to get the accuracy,recall and precision scores
for model_name, model in models.items():
  if model_name == "Logistic Regression":
    results.append(get_metrics(model, model_name, X_train_scaled, y_train, X_test_scaled, y_test))
  else:
    results.append(get_metrics(model, model_name, X_train, y_train, X_test, y_test))

# Convert results to a DataFrame for easier comparison
results_df = pd.DataFrame(results)
In [ ]:
results_df = results_df[["Model", "Train Accuracy", "Test Accuracy", "Train Recall", "Test Recall", "Train Precision", "Test Precision"]]
# Display the DataFrame using style
print("\n------------- Model Comparison Summary")
display(results_df.style.background_gradient(axis=0))
------------- Model Comparison Summary
  Model Train Accuracy Test Accuracy Train Recall Test Recall Train Precision Test Precision
0 Logistic Regression 0.872483 0.878076 0.606971 0.591036 0.711268 0.745583
1 Default Decision Tree 1.000000 0.859060 1.000000 0.591036 1.000000 0.665615
2 Tuned Decision Tree 0.853308 0.820470 0.590144 0.476190 0.644357 0.559211
3 Tuned Decision Tree with Pruning 0.831256 0.833333 0.710337 0.691877 0.560721 0.567816
4 Default Random Forest 1.000000 0.899329 1.000000 0.680672 1.000000 0.786408
5 Tuned Random Forest 0.865772 0.862416 0.774038 0.739496 0.633858 0.633094
6 Tuned Random Forest with weights 0.888782 0.870246 0.871394 0.784314 0.670055 0.643678

Comparison of Techniques and Their Relative Performance¶

Comparison of Various Techniques and Their Relative Performance¶

  • What are the key measures of success? The key measures of success is to minimize financial losses due to defaults while also ensuring we're not overly restrictive in approving loans which can impact revenue and customer satisfaction.

  • What are the important metrics to consider and why?

    • Recall: High recall minimizes the risk of missing bad loans, even if it means a higher rate of false positives (approving loans that might later default). This is important because failing to identify a default is often very costly.
    • Precision: High precision reduces the number of false positives. This is important because it reduces the operational costs of dealing with loans that were incorrectly flagged.
  • How do different techniques perform?

    • The Logistic Regression model performed poorly across all metrics, suggesting it's not suitable for this problem.
    • The Decision Tree models, both default and tuned, showed signs of overfitting, with high training accuracy but lower test accuracy. Pruning helped to mitigate this issue.
    • The Random Forest models demonstrated better generalization, with the tuned version with weights achieving the best balance of accuracy, recall, and precision on the test dataset.
  • Which one is performing relatively better and why? The Tuned Random Forest with Weights appears to be the best performer as it provides the most balanced performance across accuracy, recall, and precision on the held-out test data.

  • Is there scope to improve the performance further?

    • Feature Engineering: Creating new features from existing ones might improve the model's ability to capture complex relationships.
    • Hyperparameter Tuning: Further tuning of the Random Forest model's hyperparameters could yield better results.
    • Ensemble Methods: Combining multiple models might improve predictive accuracy.

Variable Significance¶

  • Which variables are significant in predicting the target variable? Based on correlation, DEBTINC, DEROG, DELINQ, VALUE, YOJ, and CLAGE appear to be the most significant variables in predicting loan default. These are indicators of an applicant's financial stability and credit history.

  • Are variables still continue to be significant post modelling?

  • DEBTINC: Consistently appears as the most critical variable across all models. This reinforces its importance in predicting loan default. It was the highest correlation to the target variable pre-modelling, and it is the most important variable in the tuned Random Forest model.

  • VALUE: Property value is a consistent predictor, showing up as important in all the models. It had a strong correlation with the target variable in the pre-modelling stage.

  • DEROG and DELINQ: Both continue to be significant after modeling, particularly in the Logistic Regression model. They were both correlated with the target variable in the pre-modelling phase.

  • CLAGE: older credit lines continue to be significant. This was identified in the pre-modelling phase as well.

  • YOJ: Years at the current job are significant. This was also identified in the pre-modelling phase.

  • Missing Value Flags: These variables are very important in all models, indicating that their lack of data increases the likelihood of default.

Meaningful Insights¶

  • What are the most meaningful insights from the data relevant to the problem?
    • The strong influence of missing values suggests that the process of data collection might be an indicator of risk. We should investigate why certain applicants have missing data and whether this is correlated with other factors.
    • The importance of credit history and debt-to-income ratio confirms our understanding of loan risk. This highlights the need for careful assessment of these factors during loan approval.
    • The fact that certain job categories are associated with higher risk suggests that we might need more tailored underwriting criteria for applicants in those professions.

1. Is the Model Performance Good Enough for Deployment in Production?¶

We trained and evaluated several models, with the following key results:

  • Logistic Regression:
    • Good at identifying non-defaulters.
    • Not as effective at identifying defaulters.
    • Recall on test set: 59.1%
    • Precision: 74.6%
    • Accuracy: 87.8%
  • Decision Tree (Default):
    • Overfit significantly, showing poor generalization.
    • Recall on test set: 59.1%
    • Precision: 66.6%
    • Accuracy: 85.9%
  • Decision Tree (Tuned):
    • Improved recall, but still not ideal.
    • Recall on test set: 47.6%
    • Precision: 55.9%
    • Accuracy: 82.0%
  • Decision Tree (Tuned with Pruning):
    • Recall on test set: 69.2%
    • Precision: 56.8%
    • Accuracy: 83.3%
  • Random Forest (Default):
    • Recall on test set: 68.1%
    • Precision: 78.6%
    • Accuracy: 89.9%
  • Random Forest (Tuned):
    • Recall: 73.9%
    • Precision: 63.3%
    • Accuracy: 86.2%
  • Random Forest (Tuned with Weights):
    • Recall: 78.4%
    • Precision: 64.4%
    • Accuracy: 87.0%

Model Readiness Assessment:

  • Random Forest (Tuned with Weights): This model offers the best balance of recall, precision, and accuracy among the models tested. While its recall (78.4%) is slightly lower than the tuned Random Forest without weights, it compensates with higher precision (64.4%) and overall accuracy (87.0%).

Generalization: The cross-validation process helped improve the model's ability to generalize.

Data: The dataset was cleaned and pre-processed to improve its quality.

Conclusion: Based on the above analysis, the Random Forest with Weights is the best-performing model and is suitable for deployment, assuming the business context favors a balanced performance. While it doesn't quite reach 80% recall, the balanced performance makes it the most promising candidate.

2. Is the Model Interpretable?¶

The tuned Random Forest model, particularly the weighted version, provides sufficient interpretability through feature importance and correlation analysis. This enables the lending institution to understand the major factors driving loan default predictions, such as debt-to-income ratio, derogatory reports, and credit history. We used GridSearchCV to perform hyperparameter tuning, optimizing the model for recall, which aligns with our project's objective of minimizing false negatives. This ensures we correctly identify as many potential defaults as possible, even if it results in some false positives.

3. What Model Do You Propose to Be Adopted?¶

Proposed Model: The Random Forest with Weights is the proposed model for adoption.

4. Why Is This the Best Solution to Adopt?¶

  • Balanced Performance: The Random Forest with weights offers the best balance of recall, precision, and accuracy among the tested models.
  • High Overall Performance: It has high precision and accuracy, in addition to good recall.
  • Good Generalization: The use of multiple trees and hyperparameter tuning improved its ability to generalize.
  • Feature Importance: The model provides feature importance scores, allowing us to understand which factors are driving the predictions, and it aligns with the pre-modelling correlation analysis.
  • Robustness: Random Forests are known to be robust to outliers and noise in the data.
  • Predictors: The model uses the variables that have been determined to be significant in detecting defaults.

5. How Does This Solve the Problem?¶

The Random Forest with weights aims to solve the problem of accurately identifying loan applications with a high likelihood of default. The model and feature importance plots provide data-driven insights into the factors that influence default, enabling more informed decision-making. The model uses the variables that were the most significant in predicting the target variable.

The model's effectiveness is further demonstrated by its performance metrics:

  • Significant Default Rate Reduction: The model reduces the default rate from a baseline of 20% to an estimated 13%, a 7% decrease.
  • Improved Default Prediction: It correctly identifies 280 defaults, reducing missed defaults (false negatives) to 77.
  • Potential Loss Reduction: This translates to a potential loss reduction of $7.76 million, highlighting the model's practical impact on profitability.

Risks and Challenges¶

While the proposed loan default prediction model offers significant benefits, it's crucial to acknowledge potential risks and challenges associated with its implementation:

1. Data Drift:

  • Description: The model's performance may degrade over time if the underlying data distribution changes (e.g., due to economic shifts or changes in lending practices).
  • Mitigation: Implement ongoing monitoring of model performance and retrain the model periodically with updated data to address data drift.

2. Model Bias:

  • Description: The model could inadvertently perpetuate existing biases present in the training data, leading to unfair or discriminatory outcomes.
  • Mitigation: Carefully select and preprocess data to minimize bias, use fairness-aware algorithms, and regularly audit model predictions for potential bias.

3. Data Security and Privacy:

  • Description: The model relies on sensitive borrower data, raising concerns about data security and privacy breaches.
  • Mitigation: Implement robust data security measures, ensure compliance with relevant privacy regulations, and anonymize or pseudonymize data whenever possible.

4. Implementation and Maintenance Costs:

  • Description: Developing, deploying, and maintaining the model requires technical expertise, infrastructure, and ongoing resources, which can incur significant costs.
  • Mitigation: Carefully plan the implementation process, leverage cloud-based solutions to reduce infrastructure costs, and establish clear ownership and responsibilities for model maintenance.

5. Regulatory Compliance:

  • Description: The model must comply with relevant lending regulations and guidelines, which can vary across jurisdictions.
  • Mitigation: Ensure the model's development and deployment adhere to applicable regulations, consult with legal and compliance experts, and maintain documentation of compliance efforts.

Addressing these risks and challenges proactively is crucial for the successful implementation and long-term effectiveness of the loan default prediction model.

Next Steps:¶

This project has successfully developed a promising loan default prediction model. To further enhance its performance and ensure its practical utility, the following next steps are recommended:

1. Further Tune Decision Thresholds:

  • Description: Continue adjusting probability thresholds to optimize the trade-off between precision and recall. This involves finding the threshold that best balances the need to identify true defaulters (recall) while minimizing false alarms (precision).
  • Rationale: Fine-tuning the decision threshold can significantly impact the model's operational effectiveness in a real-world setting.

2. Feature Engineering:

  • Description: Investigate adding or transforming features to improve model interpretability and potentially enhance its predictive power. This could involve creating new variables from existing ones or exploring interactions between features.
  • Rationale: New features may capture hidden patterns in the data that the current model is missing, leading to better performance.

3. Deploy & Monitor Model:

  • Description: If used in production, set up a system to continuously monitor the model's performance and detect potential concept drift. This involves tracking key metrics over time and retraining the model periodically with updated data to maintain accuracy.
  • Rationale: Monitoring is crucial to ensure the model remains accurate and relevant as the underlying data distribution may change over time.

4. Compare Against Alternative Models:

  • Description: Consider exploring alternative models, such as gradient boosting (e.g., XGBoost, LightGBM) or deep learning approaches, to see if they can further improve performance, especially if higher recall is desired.
  • Rationale: Different models have varying strengths and weaknesses, and exploring alternatives can help identify the best model for this specific task.

5. Refine Hyperparameter Tuning:

  • Description: Revisit the hyperparameter tuning process for the chosen model to see if further optimization can yield noticeable improvements in performance. This involves exploring a wider range of hyperparameter values or using more advanced tuning techniques.
  • Rationale: Fine-tuning hyperparameters can often lead to small but significant improvements in model accuracy.

6. Explore Cost-Sensitive Learning:

  • Description: Given the potential imbalance in the dataset and the varying costs associated with different types of errors (false positives vs. false negatives), explore cost-sensitive learning techniques. This involves assigning different misclassification costs during model training to optimize the model for the specific business context.
  • Rationale: Cost-sensitive learning can improve the model's decision-making by prioritizing the predictions that have the most significant financial impact.

7. Error Analysis:

  • Description: Analyze instances where the model made incorrect predictions (false positives and false negatives) to identify potential patterns or biases. This involves examining the characteristics of misclassified cases to understand why the model made mistakes and how to improve its performance.
  • Rationale: Error analysis can provide valuable insights for feature engineering, model selection, and overall model refinement.

8. Model Explainability and Interpretability:

  • Description: Implement techniques such as SHAP values or LIME to explain the model's predictions and understand the factors influencing its decisions. This can increase transparency and build trust in the model's recommendations.
  • Rationale: Explainable models are easier for stakeholders to understand and accept, especially in sensitive domains like lending.